Linguistically-Enhanced Search over an Open Diachronic Corpus
نویسندگان
چکیده
The BVC section of the impact-es diachronic corpus of historical Spanish compiles 86 books —containing approximately 2 million words. About 27% of the words —providing a representative coverage of the most frequent word forms— have been annotated with their lemma, part of speech, and modern equivalent following the Text Encoding Initiative guidelines. We describe how this type of annotation can be exploited to provide linguistically-enhanced search over historical documents. The advanced search supports queries whose search terms can be a combination of surface forms, lemmata, parts of speech and modern forms of historical variants.
منابع مشابه
Gearing the Discursive Practice to the Evolution of Discipline: Diachronic Corpus Analysis of Stance Markers in Research Articles’ Methodology Section
Despite widespread interest and research among applied linguists to explore metadiscourse use, very little is known of how metadiscourse resources have evolved over time in response to the historically developing practices of academic communities. Motivated by such an ambition, the current research drew on a corpus of 874315 words taken from three leading journals of applied linguistics in orde...
متن کاملAutomatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC
This paper presents the Tübingen Baumbank des Deutschen Diachron (TüBa-D/DC), a linguistically annotated corpus of selected diachronic materials from the German Gutenberg Project. It was automatically annotated by a suite of NLP tools integrated into WebLicht, the linguistic chaining tool used in CLARIN-D. The annotation quality has been evaluated manually for a subcorpus ranging from Middle Hi...
متن کاملInvestigating Lexico-grammaticality in Academic Abstracts and Their Full Research Papers from a Diachronic Perspective
Development of science and academic knowledge has led to changes in academic language and transfer of information and knowledge. In this regard, the present study is an attempt to investigate lexico-grammaticality in academic abstracts and their full research papers in Linguistics, Chemistry and Electrical engineering papers published during 1991-2015 in academic journals from a diachronic pers...
متن کاملAn open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling
The impact-es diachronic corpus of historical Spanish compiles over one hundred books —containing approximately 8 million words— in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in...
متن کاملCompasses, Magnets, Water Microscopes Annotation and Analysis of Terminology in a Diachronic Corpus of Scientific Texts
The specialised lexicon belongs to the most prominent attributes of specialised writing: Terms function as semantically dense encodings of specialised concepts, which, in the absence of terms, would require lengthy explanations and descriptions. In this paper, we argue that terms are the result of diachronic processes on both the semantic and the morpho-syntactic level. Very little is known abo...
متن کامل